Here we will work with fungal ITS amplicon data processed using similar methods to those described in the first part of the workshop. These data were obtained from soil and root samples collected from plots receiving different fertiliser treatments. We also have additional metadata for these samples, in the form of several chemical and biological parameters, that we will link to the amplicon data and use in our analyses.

Here, I will use base-R and tidyverse functions to show you how you can do this. There are specialised packages for, for example, reading and manipulating data in biom format, but it is simple enough to use approaches that are generally applicable to reading and manipulating data. I prefer to show it this way because it is very flexible and emphasises the importance of having a well-structured and organised workflow. The example used here demonstrates the importance of this flexibility since the sequencing data and metadata were collected at different hierarchical levels. However, we will use some convenient functions from the phyloseq libraries to inspect and subset/prune the data.

Most of the libraries used here can be installed from CRAN using, for example:

install.packages('vegan')

Some packages are only available on github, these can be installed as follows:

install.packages('devtools')
devtools::install_github('joey711/phyloseq')
devtools::install_github('cpauvert/psadd')
devtools::install_github('jfq3/QsRutils')
devtools::install_github('jfq3/ggordiplots')
devtools::install_github('pmartinezarbizu/pairwiseAdonis/pairwiseAdonis')

Step 1: Read in and manipulate the OTU tables

Our OTU data are stored in a tab-delimited text file, generated using biom convert on a .biom file. The first few rows are shown in the screengrab below.

The first row contains information that is not useful for us, so we will skip reading it.

# read OTU data, calling this 'dat' 
# use the 'skip' argument to not read in the first row
# use the 'stringsAsFactors' argument to read OTU names as character
dat <- read.delim('data/nutnet_ITS.txt', skip=1, stringsAsFactors=F)

# inspect the structure of the data
str(dat)
## 'data.frame':    3504 obs. of  33 variables:
##  $ X.OTU.ID          : chr  "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
##  $ barcodelabel.JP10.: int  4990 803 59 2971 1262 441 1532 167 1 107 ...
##  $ barcodelabel.JP11.: int  176 487 37 8992 282 721 3307 561 59 42 ...
##  $ barcodelabel.JP12.: int  30 18 26 116 80 235 158 47 18 31 ...
##  $ barcodelabel.JP13.: int  89 50 51 1553 109 289 2087 578 17 56 ...
##  $ barcodelabel.JP14.: int  11 0 371 1781 894 1451 5140 398 9 672 ...
##  $ barcodelabel.JP15.: int  33 398 9 1678 333 1010 2455 532 214 159 ...
##  $ barcodelabel.JP16.: int  405 113 79 2482 1197 325 2406 155 872 8302 ...
##  $ barcodelabel.JP17.: int  10358 7662 1 158 1130 1347 843 3643 3885 8 ...
##  $ barcodelabel.JP18.: int  10611 2001 323 42 262 1910 133 705 4008 917 ...
##  $ barcodelabel.JP19.: int  112 188 500 112 990 462 230 152 210 308 ...
##  $ barcodelabel.JP1. : int  170 7 2 91 2 35 71 9 5 3 ...
##  $ barcodelabel.JP20.: int  499 2429 3081 382 2943 5285 2529 6200 3498 373 ...
##  $ barcodelabel.JP21.: int  12338 7136 8152 721 1346 6263 1849 1763 2123 8 ...
##  $ barcodelabel.JP22.: int  5359 5982 10495 139 3812 2398 923 490 6222 2232 ...
##  $ barcodelabel.JP23.: int  4598 9549 9407 114 1627 3099 419 1072 464 280 ...
##  $ barcodelabel.JP24.: int  15763 1034 581 142 3080 2609 387 464 785 210 ...
##  $ barcodelabel.JP25.: int  7941 5045 1647 469 4497 1631 899 495 113 399 ...
##  $ barcodelabel.JP26.: int  7059 8749 2077 240 7373 829 192 1609 133 34 ...
##  $ barcodelabel.JP27.: int  10335 7932 4564 440 4093 4114 320 3733 720 2054 ...
##  $ barcodelabel.JP28.: int  1056 2686 5734 116 5478 2293 85 503 361 9062 ...
##  $ barcodelabel.JP29.: int  745 2204 441 212 374 2140 590 8724 70 0 ...
##  $ barcodelabel.JP2. : int  10 1091 37 2040 0 114 1051 1 1477 110 ...
##  $ barcodelabel.JP30.: int  1578 784 8096 283 1624 1534 805 1137 137 207 ...
##  $ barcodelabel.JP31.: int  55 40 176 3 142 189 13 120 23 6 ...
##  $ barcodelabel.JP32.: int  4849 392 4820 205 2977 1099 243 1292 558 360 ...
##  $ barcodelabel.JP3. : int  0 1097 57 4124 38 858 1368 32 152 657 ...
##  $ barcodelabel.JP4. : int  683 594 79 4521 54 512 3353 840 2989 102 ...
##  $ barcodelabel.JP5. : int  1172 101 71 5285 412 628 3077 10 804 0 ...
##  $ barcodelabel.JP6. : int  20 349 626 4882 172 904 1911 500 785 618 ...
##  $ barcodelabel.JP7. : int  33 248 1097 2828 75 833 2387 37 2 260 ...
##  $ barcodelabel.JP8. : int  2701 329 459 3575 1471 955 2877 571 127 173 ...
##  $ barcodelabel.JP9. : int  44 19 35 407 0 137 303 118 17 66 ...

First we will tidy up the names of our columns, since there is extra text in the sample names introduced during sequence processing (these samples were labelled JP# when submitted for sequencing). Then we will transpose the data so that samples are in rows and OTUs are in columns.

# remove the extra text around the sample labels ('JP#') using 'gsub'
# arguments are 1: pattern to find, 2: replacement, 3: where to look
# 'fixed=T' is needed because, without it, '.' is interpreted as any symbol
names(dat) <- gsub('barcodelabel.', '', names(dat), fixed=T)
names(dat)
##  [1] "X.OTU.ID" "JP10."    "JP11."    "JP12."    "JP13."    "JP14."   
##  [7] "JP15."    "JP16."    "JP17."    "JP18."    "JP19."    "JP1."    
## [13] "JP20."    "JP21."    "JP22."    "JP23."    "JP24."    "JP25."   
## [19] "JP26."    "JP27."    "JP28."    "JP29."    "JP2."     "JP30."   
## [25] "JP31."    "JP32."    "JP3."     "JP4."     "JP5."     "JP6."    
## [31] "JP7."     "JP8."     "JP9."
names(dat) <- gsub('.', '', names(dat), fixed=T)
names(dat)
##  [1] "XOTUID" "JP10"   "JP11"   "JP12"   "JP13"   "JP14"   "JP15"  
##  [8] "JP16"   "JP17"   "JP18"   "JP19"   "JP1"    "JP20"   "JP21"  
## [15] "JP22"   "JP23"   "JP24"   "JP25"   "JP26"   "JP27"   "JP28"  
## [22] "JP29"   "JP2"    "JP30"   "JP31"   "JP32"   "JP3"    "JP4"   
## [29] "JP5"    "JP6"    "JP7"    "JP8"    "JP9"
# load the 'dplyr' package to use functions for data manipulation
library(dplyr)

# extract the read counts into a separate dataframe, calling this 'mat'
# use the 'select' function from the 'dplyr' package
# first argument is the dataframe to use, others refer to columns to keep/lose
mat <- select(dat, -XOTUID)

# convert 'mat' to a matrix object and look at the object structure
mat <- as.matrix(mat)
str(mat)
##  int [1:3504, 1:32] 4990 803 59 2971 1262 441 1532 167 1 107 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
# use the column of OTU names in 'dat' as row names in 'mat'
# this is okay to do because we have not changed the order of rows or columns in either object
# the '$' indicates that we want to use the column in 'dat' that is named 'XOTUID'
rownames(mat) <- dat$XOTUID
str(mat)
##  int [1:3504, 1:32] 4990 803 59 2971 1262 441 1532 167 1 107 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3504] "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
##   ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
# transpose the matrix using the 't' function (samples -> rows, OTUs -> columns)
mat <- t(mat)
str(mat)
##  int [1:32, 1:3504] 4990 176 30 89 11 33 405 10358 10611 112 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
##   ..$ : chr [1:3504] "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
# bind the OTU counts in 'mat' to a named column of sample IDs (the rownames of 'mat',   
# name this column 'SampleID') and, in the process, replace 'dat' with a new dataframe object
# use the 'stringsAsFactors' argument to treat OTU names as character
# 
dat <- data.frame(SampleID=rownames(mat), mat, stringsAsFactors=F)

# don't look at the structure of 'dat' since it contains a few thousand columns
# but you can check the dimensions using the 'dim' function (returns numbers of 
# rows and columns) to make sure that it has only one more column than 'mat'
dim(mat)
## [1]   32 3504
dim(dat)
## [1]   32 3505
# clear 'mat' from the workspace
rm(mat)

Step 2: Read in the metadata and join with the OTU data

The metadata are stored in three .csv files. The first contains the information regarding the sampled material (root or soil) and the plot IDs indicating the plots from which those samples were collected.

The second contains the information about what treatments were applied to the plots and in which block each plot was found.

The third contains several soil chemical and enzymatic measurements that were taken on the same soil samples.

# in each case, use the 'stringsAsFactors' argument to treat OTU names as character
# we can change some variables to factors later, this is easier

# read in sample information
samps <- read.csv('data/nutnet_samples.csv', stringsAsFactors=F)
str(samps)
## 'data.frame':    32 obs. of  3 variables:
##  $ SampleID  : chr  "JP1" "JP2" "JP3" "JP4" ...
##  $ SampleType: chr  "soil" "soil" "soil" "soil" ...
##  $ PlotID    : int  44 45 40 42 36 29 33 22 32 5 ...
# 'SampleType' can be converted to a factor variable
samps$SampleType <- factor(samps$SampleType)
str(samps)
## 'data.frame':    32 obs. of  3 variables:
##  $ SampleID  : chr  "JP1" "JP2" "JP3" "JP4" ...
##  $ SampleType: Factor w/ 2 levels "root","soil": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PlotID    : int  44 45 40 42 36 29 33 22 32 5 ...
summary(samps)
##    SampleID         SampleType     PlotID     
##  Length:32          root:16    Min.   : 3.00  
##  Class :character   soil:16    1st Qu.:14.00  
##  Mode  :character              Median :25.50  
##                                Mean   :24.88  
##                                3rd Qu.:37.00  
##                                Max.   :45.00
# read in plot information
plots <- read.csv('data/nutnet_plots.csv', stringsAsFactors=F)
str(plots)
## 'data.frame':    16 obs. of  3 variables:
##  $ BlockID  : int  4 4 4 4 3 3 3 3 2 1 ...
##  $ PlotID   : int  44 45 40 42 36 29 33 22 32 5 ...
##  $ Treatment: chr  "NP" "control" "P" "N" ...
# 'BlockID' and 'Treatment' can be converted to factor variables
# use the 'levels' argument to control the ordering of treatments (default is alphabetical)
# don't convert 'PlotID' to a factor variable yet
plots$BlockID <- factor(plots$BlockID)
plots$Treatment <- factor(plots$Treatment, levels=c('control', 'N', 'P', 'NP'))
str(plots)
## 'data.frame':    16 obs. of  3 variables:
##  $ BlockID  : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 3 3 3 3 2 1 ...
##  $ PlotID   : int  44 45 40 42 36 29 33 22 32 5 ...
##  $ Treatment: Factor w/ 4 levels "control","N",..: 4 1 3 2 1 2 3 3 4 3 ...
summary(plots)
##  BlockID     PlotID        Treatment
##  1:4     Min.   : 3.00   control:4  
##  2:4     1st Qu.:14.00   N      :4  
##  3:4     Median :25.50   P      :4  
##  4:4     Mean   :24.88   NP     :4  
##          3rd Qu.:37.00              
##          Max.   :45.00
# read in additional soil data 
soils <- read.csv('data/nutnet_soildata.csv', stringsAsFactors=F)
str(soils)
## 'data.frame':    16 obs. of  19 variables:
##  $ PlotID             : int  3 4 5 8 16 18 21 22 29 32 ...
##  $ Microbial_Biomass_C: num  2.16 2.44 1.89 2.74 2.12 ...
##  $ Microbial_N        : num  0.092 0.118 0.123 0.135 0.142 0.106 0.157 0.126 0.156 0.164 ...
##  $ Microbial_P        : num  0.18 0.061 0.331 0.065 0.169 0.053 0.062 0.184 0.198 0.061 ...
##  $ Total_C_mg_g_soil  : num  119.8 401.5 56.9 422.6 125.9 ...
##  $ Total_N_mg_g_soil  : num  5.09 19.38 3.7 20.87 8.44 ...
##  $ Total_P_mg_g_soil  : num  2508 4228 2848 2943 5718 ...
##  $ X..Moisture        : num  15.9 16.7 15.9 15.7 14.5 ...
##  $ DOC_Soil           : num  0.208 0.083 0.305 0.081 0.267 0.2 0.123 0.247 0.328 0.232 ...
##  $ Extractable_N_Soil : num  0.007 0.011 0.009 0.012 0.011 0.021 0.014 0.011 0.008 0.012 ...
##  $ Extractable_P_Soil : num  0.051 0.007 0.107 0.002 0.061 0.002 0.002 0.052 0.049 0.002 ...
##  $ BG_Enz             : num  144.6 54.7 109 136 117.8 ...
##  $ NAG_Enz            : num  74.4 41.2 36.8 68.5 45.2 ...
##  $ LAP_Enz            : num  27.7 25.6 29.7 30.8 24.8 ...
##  $ N_Enz              : num  102.1 66.8 66.5 99.4 70 ...
##  $ PHOS_Enz           : num  369 284 463 597 411 ...
##  $ CN_Enz             : num  1.417 0.819 1.638 1.369 1.682 ...
##  $ NP_Enz             : num  0.276 0.235 0.144 0.167 0.171 ...
##  $ CP_Enz             : num  0.391 0.193 0.235 0.228 0.287 ...

Having all of these data in different objects isn’t convenient since we want to make comparisons among the different types of data. To do this, we need to make sure that the data in the different objects are aligned row-by-row. To avoid making mistakes downstream, I prefer to combine all of these data into a single dataframe instead of sorting or indexing objects separately. This is easy to do using the ‘join’ functions from the ‘dplyr’ package, especially if we take care to ensure that the columns that we use to cross-index between dataframes are named consistently (although there is a workaround if not). It is much easier than doing this manually given the fact that some objects have twice the number of rows due to two sample types were sequenced for each plot.

# we use 'full_join' here because there are no extra rows in our various dataframes 
# but type '?join' in the console to see the many flexible options for joining dataframes

# join the sample and plot information into a single object called 'df'
df <- full_join(samps, plots)
## Joining, by = "PlotID"
summary(df)
##    SampleID         SampleType     PlotID      BlockID   Treatment
##  Length:32          root:16    Min.   : 3.00   1:8     control:8  
##  Class :character   soil:16    1st Qu.:14.00   2:8     N      :8  
##  Mode  :character              Median :25.50   3:8     P      :8  
##                                Mean   :24.88   4:8     NP     :8  
##                                3rd Qu.:37.00                      
##                                Max.   :45.00
# join the soil data to 'df' (note that the soil data will be duplicated across the two 
# 'SampleTypes' since this is unique to each plot, not each sample that was sequenced)
df <- full_join(df, soils)
## Joining, by = "PlotID"
summary(df)
##    SampleID         SampleType     PlotID      BlockID   Treatment
##  Length:32          root:16    Min.   : 3.00   1:8     control:8  
##  Class :character   soil:16    1st Qu.:14.00   2:8     N      :8  
##  Mode  :character              Median :25.50   3:8     P      :8  
##                                Mean   :24.88   4:8     NP     :8  
##                                3rd Qu.:37.00                      
##                                Max.   :45.00                      
##  Microbial_Biomass_C  Microbial_N      Microbial_P      Total_C_mg_g_soil
##  Min.   :1.855       Min.   :0.0920   Min.   :0.05300   Min.   : 56.92   
##  1st Qu.:1.979       1st Qu.:0.1217   1st Qu.:0.06175   1st Qu.:108.47   
##  Median :2.151       Median :0.1490   Median :0.10700   Median :211.53   
##  Mean   :2.187       Mean   :0.1493   Mean   :0.14056   Mean   :239.70   
##  3rd Qu.:2.352       3rd Qu.:0.1653   3rd Qu.:0.18750   3rd Qu.:395.49   
##  Max.   :2.743       Max.   :0.2570   Max.   :0.34900   Max.   :439.34   
##  Total_N_mg_g_soil Total_P_mg_g_soil  X..Moisture       DOC_Soil     
##  Min.   : 3.700    Min.   : 2508     Min.   :13.55   Min.   :0.0810  
##  1st Qu.: 7.725    1st Qu.: 3591     1st Qu.:14.48   1st Qu.:0.2015  
##  Median :13.910    Median : 4363     Median :15.14   Median :0.2305  
##  Mean   :15.695    Mean   : 5294     Mean   :15.12   Mean   :0.2173  
##  3rd Qu.:24.140    3rd Qu.: 6565     3rd Qu.:15.73   3rd Qu.:0.2530  
##  Max.   :31.910    Max.   :12651     Max.   :16.71   Max.   :0.3280  
##  Extractable_N_Soil Extractable_P_Soil     BG_Enz          NAG_Enz      
##  Min.   :0.00600    Min.   :0.00200    Min.   : 54.74   Min.   : 32.66  
##  1st Qu.:0.00775    1st Qu.:0.00200    1st Qu.:104.09   1st Qu.: 36.38  
##  Median :0.01100    Median :0.01950    Median :120.08   Median : 43.23  
##  Mean   :0.01056    Mean   :0.03269    Mean   :123.19   Mean   : 55.72  
##  3rd Qu.:0.01200    3rd Qu.:0.05125    3rd Qu.:137.69   3rd Qu.: 58.54  
##  Max.   :0.02100    Max.   :0.10700    Max.   :213.97   Max.   :130.16  
##     LAP_Enz          N_Enz           PHOS_Enz         CN_Enz      
##  Min.   :23.31   Min.   : 56.97   Min.   :284.1   Min.   :0.8193  
##  1st Qu.:24.75   1st Qu.: 65.45   1st Qu.:346.1   1st Qu.:1.3609  
##  Median :27.31   Median : 68.71   Median :419.1   Median :1.5719  
##  Mean   :27.10   Mean   : 82.82   Mean   :462.2   Mean   :1.5872  
##  3rd Qu.:29.28   3rd Qu.: 88.01   3rd Qu.:548.7   3rd Qu.:1.7357  
##  Max.   :30.84   Max.   :157.01   Max.   :902.9   Max.   :3.1758  
##      NP_Enz           CP_Enz      
##  Min.   :0.1172   Min.   :0.1461  
##  1st Qu.:0.1453   1st Qu.:0.2236  
##  Median :0.1691   Median :0.2667  
##  Mean   :0.1901   Mean   :0.2792  
##  3rd Qu.:0.1945   3rd Qu.:0.3419  
##  Max.   :0.4572   Max.   :0.4156
# join the OTU data to 'df', but don't look at a summary of the whole dataframe
# since this will be very large -- instead we can use the square brackets ('[]')
# to look at the structure of the dataframe based on the first few columns
df <- full_join(df, dat)
## Joining, by = "SampleID"
str(df[, 1:25])
## 'data.frame':    32 obs. of  25 variables:
##  $ SampleID           : chr  "JP1" "JP2" "JP3" "JP4" ...
##  $ SampleType         : Factor w/ 2 levels "root","soil": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PlotID             : int  44 45 40 42 36 29 33 22 32 5 ...
##  $ BlockID            : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 3 3 3 3 2 1 ...
##  $ Treatment          : Factor w/ 4 levels "control","N",..: 4 1 3 2 1 2 3 3 4 3 ...
##  $ Microbial_Biomass_C: num  1.99 2.12 2.4 2.19 2.15 ...
##  $ Microbial_N        : num  0.165 0.194 0.257 0.166 0.174 0.156 0.114 0.126 0.164 0.123 ...
##  $ Microbial_P        : num  0.069 0.071 0.349 0.198 0.055 0.198 0.143 0.184 0.061 0.331 ...
##  $ Total_C_mg_g_soil  : num  285.7 297.2 68.7 110.9 393.5 ...
##  $ Total_N_mg_g_soil  : num  23.68 27.26 7.35 8.42 31.91 ...
##  $ Total_P_mg_g_soil  : num  7322 3622 4814 7371 6313 ...
##  $ X..Moisture        : num  14.2 13.6 14.2 15 14.6 ...
##  $ DOC_Soil           : num  0.259 0.229 0.251 0.235 0.227 0.328 0.202 0.247 0.232 0.305 ...
##  $ Extractable_N_Soil : num  0.007 0.014 0.009 0.011 0.006 0.008 0.006 0.011 0.012 0.009 ...
##  $ Extractable_P_Soil : num  0.002 0.002 0.101 0.049 0.002 0.049 0.032 0.052 0.002 0.107 ...
##  $ BG_Enz             : num  105.3 100.5 142.7 131.9 99.2 ...
##  $ NAG_Enz            : num  32.7 39.3 130.2 121.5 35.1 ...
##  $ LAP_Enz            : num  24.3 27.5 26.8 29.9 23.3 ...
##  $ N_Enz              : num  57 66.7 157 151.4 58.4 ...
##  $ PHOS_Enz           : num  294 477 343 903 333 ...
##  $ CN_Enz             : num  1.848 1.505 0.909 0.871 1.698 ...
##  $ NP_Enz             : num  0.194 0.14 0.457 0.168 0.175 ...
##  $ CP_Enz             : num  0.358 0.211 0.416 0.146 0.298 ...
##  $ ITSall_OTUa_524    : int  170 10 0 683 1172 20 33 2701 44 4990 ...
##  $ ITSall_OTUa_78     : int  7 1091 1097 594 101 349 248 329 19 803 ...
# now that we have joined the data, we can convert 'PlotID' to factor
df$PlotID <- factor(df$PlotID)

# keep the workspace tidy by removing unnecessary objects
rm(dat, plots, samps, soils)

It is easy extract certain types of variables later on when necessary because I have been careful in naming my variables (more on this later). It also enhances reproducability because I am always starting from the same dataframe.

Step 3: Read in taxonomic annotations

The blast results are stored in a tab-delimited test file, seen below.

The first two commands in the code below read the file into R and renames the first column. The third command keeps only the first row of the result for each OTU, this is the result with the bet evalue and bitscore. You may have other criteria that you use to select the most appropriate taxonomic annotation, requiring some modifications to this code.

# read in the data, use 'stringsAsFactors=F' to keep text as 'character'
tax <- read.delim('data/nutnet_tax.txt', stringsAsFactors=F)

# rename the first column
names(tax)[1] <- 'OTUId'

# keep only the first row of each OTU
tax <- tax[!duplicated(tax$OTUId), ]

Your taxonomic annotations may not yet be parsed into the different taxonomic levels, this can be done using the following code.

# split the 'taxonomy' column in 'tax' by the delimiter
temp <- strsplit(tax$taxonomy, ';')

# collapse into a matrix, naming each column
temp <- do.call('rbind', temp)
colnames(temp) <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')

# any kingdom-to-genus unclassified, change to NA
temp[temp %in% grep('unclassified', temp, value=T)] <- NA

# any species unclassified, change to NA
temp[temp %in% grep('_sp$', temp, value=T)] <- NA

# clean up characters at the start of each remaining annotation
temp <- gsub('^[a-z]\\_\\_', '', temp)

# add new columns onto 'tax' data.frame
tax <- cbind(tax, temp, stringsAsFactors=F)

# clean up workspace
rm(temp)

Inspecting the data

The phyloseq library contains several helper functions for generating many of the summary plots that we might like to use to inspect the data. The first step is to split the data into its various components and then to merge into a phyloseq object.

# load library
library(phyloseq)

# extract the OTU data from 'df' and convert to matrix format
# using 'starts_with' with 'select' allows us to chose all columns containing OTU counts
otus <- as.matrix(select(df, starts_with('ITSall')))
# rows need to be named with the sample names from 'df', 
# which we can do directly because they are in the same order
rownames(otus) <- df$SampleID

# extract the sample data from from 'df' and keep as dataframe format (mix of character and numeric variables)
samps <- select(df, SampleID:CP_Enz)
# rows need to be named with the sample names from 'df', 
# which we can do directly because they are in the same order
rownames(samps) <- df$SampleID

# extract the taxonomy info for each OTU from 'tax' and convert to matrix format
taxonomy <- as.matrix(select(tax, kingdom:species))
# rows need to be named with the OTU names from 'tax', 
# which we can do directly because they are in the same order
rownames(taxonomy) <- tax$OTUId

# merge the three objects into a single phyloseq object using 'merge_phyloseq'
# each function nexted within the call to 'merge_phyloseq' creates a special object for that type of data
# because of how the OTU matrix is oriented, we have to specify 'taxa_are_rows=F' 
phy <- merge_phyloseq(otu_table(otus, taxa_are_rows=F), 
                      sample_data(samps), tax_table(taxonomy))

# remove extra objects to keep workspace tidy
rm(otus, samps, taxonomy)

We can produce some barplots to look at the distribution of taxa in our data. plot_bar works with a phyloseq object, and we can do further modification of the plot with ggplot2 functions.

# load library
library(ggplot2)

# bar plot of phylum by treatment
plot_bar(phy, x='Treatment', fill='phylum') + 
        # create separate plots for each sample type
        facet_wrap(~SampleType) + 
        # add this line to suppress OTU counts
        geom_bar(aes(color=phylum), stat='identity', position='stack')

There are a lot of OTUs that have not been classified at the phylum level, we may want to exclude these. There are some others that have been classified to nonfungal taxa and we will definitely want to exclude these. The code below overwrites the existing phyloseq object after removing the OTUs that are nonfungal and unclassified at the phylum level.

# keep only fungal OTUs with at least phylum-level classification 
phy <- subset_taxa(phy, kingdom=='Fungi' & !is.na(phylum))

# bar plot of phylum by treatment
plot_bar(phy, x='Treatment', fill='phylum') + 
        # create separate plots for each sample type
        facet_wrap(~SampleType) + 
        # add this line to suppress OTU counts
        geom_bar(aes(color=phylum), stat='identity', position='stack')

Note: if your annotations say unclassified <taxon> instead of NA, the following should work:

# keep only fungal OTUs with at least phylum-level classification 
# the 'phylum %in% grep' call returns 'TRUE' if unclassified, and
# the '!' forces it to return the opposite ('FALSE')
phy <- subset_taxa(phy, kingdom == 'Fungi' & 
                     !phylum %in% grep('unclassified', phylum, value=T))

Also note that you can modify this code to only keep certain taxa if you want to restrict your analysis to a narrow subset.

Let’s modify the plot to represent relative percentages instead of OTU counts. We include the scales library to customise the y-axis.

We can also produce a krona plot that we can open in a web browser and will be more convenient to inspect. The code below produces a krona plot for each level of SampleType and saves the result in an html file in the working directory. The function requires installation of kronaTools; download source here (Clone or download button) and follow instructions here (requires perl).

I have posted a version of this result here.

# load library
library(psadd)

# generate output file (html) containing krona plot
plot_krona(phy, output='nutnet_ITS', variable='SampleType')

We can also produce a network graph to visualise relationships among samples. At this stage, this dataset has very high OTU turnover among samples so we have to set a very high max.dist to include all samples.

# make the network object
net <- make_network(phy, 'samples', max.dist=0.9)

# plot the network object, setting symbol colour and shape
plot_network(net, phy, color='SampleType', shape='Treatment')
## Warning: attributes are not identical across measure variables; they will
## be dropped

Finally, let’s plot an ordination to visualise relationships in multivariate space. More on the speific methods used in this ordination below.

# plot the ordination using multidimensional scaling of Bray-Curtis distances 
plot_ordination(phy, ordinate(phy, method='PCoA', distance='bray'), 
                color='SampleType', shape='Treatment', label='SampleID')

# plot a scree plot showing variation explained by each axis
plot_ordination(phy, ordinate(phy, method='PCoA', distance='bray'), 
                type='scree')